Correlation

Correlation and Causation

Simple regression: Introduction

Motivation

Predicting the Price of a used car

Simple linear regression

Simple linear regression (population) \[Y=\beta_0+\beta_1 x+\epsilon\] In our example: \[Price=\beta_0+\beta_1 Age+\epsilon\]

Simple linear regression (sample) \[\hat{y}=b_0+b_1 x\] where the coefficient \(\beta_0\) (and its estimate \(b_0\) or \(\hat{\beta}_0\) ) refers to the \(y\)-intercept or simply the intercept or the constant of the regression line, and the coefficient \(\beta_1\) (and its estimate \(b_1\) or \(\hat{\beta}_1\) ) refers to the slope of the regression line.

Least-Squares criterion

  • The least-squares criterion is that the line that best fits a set of data points is the one having the smallest possible sum of squared errors. The `errors’ are the vertical distances of the data points to the line.

  • We need to use the data to estimate the values of the parameters \(\beta_0\) and \(\beta_1\), i.e. to fit a straight line to the set of points \(\{(x_i , y_i )\}\). There are many straight lines we could use, so we need some idea of which is best. Clearly, a bad straight line model would be one that had many large errors, and conversely, a good straight line model will have, on average, small errors. We quantify this by the sum of squares of the errors: \[Q(\beta_0,\beta_1)=\sum_{i=1}^n \epsilon_i^2=\sum_{i=1}^n[y_i-(\beta_0 + \beta_1 x_i)]^2\] then the “line of best fit” will correspond to the line with values of \(\beta_0\) and \(\beta_1\) that minimises \(Q(\beta_0,\beta_1)\).

  • The regression line is the line that fits a set of data points according to the least squares criterion.

  • The regression equation is the equation of the regression line.

  • The regression equation for a set of \(n\) data points is \(\hat{y}=b_0+b_1\;x\), where \[b_1=\frac{S_{xy}}{S_{xx}}=\frac{\sum (x_i-\bar{x})(y_i-\bar{y})}{\sum (x_i-\bar{x})^2} \;\;\text{and}\;\; b_0=\bar{y}-b_1\; \bar{x}\]

  • \(y\) is the dependent variable (or response variable) and \(x\) is the independent variable (predictor variable or explanatory variable).

  • \(b_0\) is called the y-intercept and \(b_1\) is called the slope.

SSE and the standard error

This least square regression line minimizes the error sum of squares \[SSE=\sum e^2_i =\sum (y_i-\hat{y}_i)^2\] The standard error of the estimate, \(s_e=\sqrt{SSE/(n-2)}\), indicates how much, on average, the observed values of the response variable differ from the predicated values of the response variable.

Example: used cars (cont.)

The table below displays data on Age (in years) and Price (in hundreds of dollars) for a sample of cars of a particular make and model.(Weiss, 2012)

Price (\(y\)) Age (\(x\))
85 5
103 4
70 6
82 5
89 5
98 5
66 6
95 6
169 2
70 7
48 7
  • For our example, age is the predictor variable and price is the response variable.

  • The regression equation is \(\hat{y}=195.47-20.26\;x\), where the slope \(b_1=-20.26\) and the intercept \(b_0=195.47\)

  • Prediction: for \(x = 4\), that is we would like to predict the price of a 4-year-old car, \[\hat{y}=195.47-20.26 {\color{blue}(4)}= 114.43 \;\;\text{or}\;\; \$ 11443\]

Back to our used cars example, we want to find the “best line” through the data points, which can be used to predict prices of used cars based on their age.

First we need to enter the data in R.

Price<-c(85, 103,  70,  82,  89,  98,  66,  95, 169,  70,  48)
Age<- c(5, 4, 6, 5, 5, 5, 6, 6, 2, 7, 7)
carSales<-data.frame(Price,Age)
str(carSales)
## 'data.frame':    11 obs. of  2 variables:
##  $ Price: num  85 103 70 82 89 98 66 95 169 70 ...
##  $ Age  : num  5 4 6 5 5 5 6 6 2 7 ...
cor(Age, Price, method = "pearson")
## [1] -0.9237821

Scatterplot: Age vs. Price

library(ggplot2)
## Warning: package 'ggplot2' was built under R version 4.3.2
ggplot(carSales, aes(x=Age, y=Price)) + geom_point()

# Remove the confidence interval
ggplot(carSales, aes(x=Age, y=Price)) + 
  geom_point()+
  geom_smooth(method=lm, formula= y~x, se=FALSE)

Prediction

# simple linear regression
reg<-lm(Price~Age)
print(reg)
## 
## Call:
## lm(formula = Price ~ Age)
## 
## Coefficients:
## (Intercept)          Age  
##      195.47       -20.26

To predict the price of a 4-year-old car (\(x=4\)): \[\hat{y}=195.47-20.26(4)=114.43\]

Simple Regression: Coefficient of Determination

Extrapolation

  • Within the range of the observed values of the predictor variable, we can reasonably use the regression equation to make predictions for the response variable.

  • However, to do so outside the range, which is called Extrapolation, may not be reasonable because the linear relationship between the predictor and response variables may not hold here.

  • To predict the price of an 11-year old car, \(\hat{y}=195.47-20.26 (11)=-27.39\) or $ 2739, this result is unrealistic as no one is going to pay us $2739 to take away their 11-year old car.

Outliers and influential observations

  • Recall that an outlier is an observation that lies outside the overall pattern of the data. In the context of regression, an outlier is a data point that lies far from the regression line, relative to the other data points.

  • An influential observation is a data point whose removal causes the regression equation (and line) to change considerably.

  • From the scatterplot, it seems that the data point (2,169) might be an influential observation. Removing that data point and recalculating the regression equation yields \(\hat{y}=160.33-14.24\;x\).

Coefficient of determination

  • The total variation in the observed values of the response variable, \(SST=\sum(y_i-\bar{y})^2\), can be partitioned into two components:

    • The variation in the observed values of the response variable explained by the regression: \(SSR=\sum(\hat{y}_i-\bar{y})^2\)
    • The variation in the observed values of the response variable not explained by the regression: \(SSE=\sum(y_i-\hat{y}_i)^2\)
  • The coefficient of determination, \(R^2\) (or \(R\)-square), is the proportion of the variation in the observed values of the response variable explained by the regression, which is given by \[R^2=\frac{SSR}{SST}=\frac{SST-SSE}{SST}=1-\frac{SSE}{SST}\] where \(SST=SSR+SSE\). \(R^2\) is a descriptive measure of the utility of the regression equation for making prediction.

  • The coefficient of determination \(R^2\) always lies between 0 and 1. A value of \(R^2\) near 0 suggests that the regression equation is not very useful for making predictions, whereas a value of \(R^2\) near 1 suggests that the regression equation is quite useful for making predictions.

  • For a simple linear regression (one independent variable) ONLY, \(R^2\) is the square of Pearson correlation coefficient, \(r\).

  • \(\text{Adjusted}\;R^2\) is a modification of \(R^2\) which takes into account the number of independent variables, say \(k\). In a simple linear regression \(k=1\). Adjusted-\(R^2\) increases only when a significant related independent variable is added to the model. Adjusted-\(R^2\) has a crucial role in the process of model building. Adjusted-\(R^2\) is given by \[\text{Adjusted-}R^2=1-(1-R^2)\frac{n-1}{n-k-1}\]

Notation used in regression

Quantity Defining formula Computing formula
\(S_{xx}\) \(\sum (x_i-\bar{x})^2\) \(\sum x^2_i - n \bar{x}^2\)
\(S_{xy}\) \(\sum (x_i-\bar{x})(y_i-\bar{y})\) \(\sum x_i y_i - n \bar{x}\bar{y}\)
\(S_{yy}\) \(\sum (y_i-\bar{y})^2\) \(\sum y^2_i - n \bar{y}^2\)

where \(\bar{x}=\frac{\sum x_i}{n}\) and \(\bar{y}=\frac{\sum y_i}{n}\). And,

\[SST=S_{yy},\;\;\; SSR=\frac{S^2_{xy}}{S_{xx}},\;\;\; SSE=S_{yy}-\frac{S^2_{xy}}{S_{xx}} \] and \(SST=SSR+SSE\).

Pearson correlation coefficient

Pearson correlation coefficient (\(r\)) is a measure of the strength and the direction of a linear relationship between two variables in the sample,

\[r=\frac{\sum(x_{i} -\bar{x})(y_{i} -\bar{y}) }{\sqrt{\sum (x_{i} -\bar{x})^{2} \sum (y_{i} -\bar{y})^{2} } } \]

where \(r\) always lies between -1 and 1. Values of \(r\) near -1 or 1 indicate a strong linear relationship between the variables whereas values of \(r\) near 0 indicate a weak linear relationship between variables. If \(r\) is zero the variables are linearly uncorrelated, that is there is no linear relationship between the two variables.

Hypothesis testing for the population correlation coefficient \(\rho\)

Hypothesis testing for the population correlation coefficient \(\rho\).

Assumptions:

  • The sample of paired \((x, y)\) data is a random sample.
  • The pairs of \((x, y)\) data have a bivariate normal distribution.

The null hypothesis

\(H_0: \rho = 0\) (no significant correlation)

against one of the alternative hypotheses:

  • \(H_1: \rho \neq 0\) (significant correlation) ``Two-tailed test’’

  • \(H_1: \rho < 0\) (significant negative correlation) ``Left-tailed test’’

  • \(H_1: \rho > 0\) (significant positive correlation) ``Right-tailed test’’

Compute the value of the test statistic: \[t=\frac{r\; \sqrt{n-2} }{\sqrt{1-r^{2} } }\sim T_{(n-2)} \;\; \text{with}\;\; df = n-2. \]

where \(n\) is the sample size.

The critical value(s) for this test can be found from T distribution table ( \(\pm t_{\alpha/2}\) for a two-tailed test, \(- t_{\alpha}\) for a left-tailed test and \(t_{\alpha}\) for a right-tailed test).

  • If the value of the test statistic falls in the rejection region, then reject \(H_0\); otherwise, do not reject \(H_0\).
  • Statistical packages report p-values rather than critical values which can be used in testing the null hypothesis \(H_0\).

Correlation and linear transformation

  • Suppose we have a linear transformation of the two variables \(x\) and \(y\), say \(x_1=a x+b\) and \(y_1=cy+d\) where \(a>0\) and \(c>0\). Then the Pearson correlation coefficient between \(x_1\) and \(y_1\) is equal to Pearson correlation coefficient between \(x\) and \(y\).

  • For our example, suppose we convert cars’ prices from dollars to pounds (say \(\$1=\) , so \(y_1=0.75 y\)), and we left the age of the cars unchanged. Then we will find that the correlation between the age of the car and its price in pounds is equal to the one we obtained before (i.e. the correlation between the age and the price in dollars).

  • A special linear transformation is to standardize one or both variables. That is obtaining the values \(z_x=(x-\bar{x})/s_x\) and \(z_y=(y-\bar{y})/s_y\). Then the correlation between \(z_x\) and \(z_y\) is equal to the correlation between \(x\) and \(y\).

Spearman’s rho correlation coefficient (\(r_s\))

  • When the normality assumption for the Pearson correlation coefficient \(r\) cannot be met, or when one or both variables may be ordinal, then we should consider nonparametric methods such as Spearman’s rho and Kendall’s tau correlation coefficients.

  • Spearman’s rho correlation coefficient, \(r_s\),can be obtained by first rank the \(x\) values (and \(y\) values) among themselves, and then we compute the Pearson correlation coefficient of the rank pairs. Similarly \(-1\leq r_s \leq 1\), the values of \(r_s\) range from -1 to +1 inclusive.

  • Spearman’s rho correlation coefficient can be used to describe the strength of the linear relationship as well as the nonlinear relationship.

Kendall’s tau (\(\tau\)) correlation coefficient

  • Kendall’s tau, \(\tau\), measures the concordance of the relationship between two variables, and \(-1\leq \tau \leq 1\).

  • Any pair of observations \((x_i, y_i)\) and \((x_j, y_j)\) are said to be concordant if both \(x_i > x_j\) and \(y_i > y_j\) or if both \(x_i < x_j\) and \(y_i < y_j\). And they are said to be discordant, if \(x_i > x_j\) and \(y_i < y_j\) or if \(x_i < x_j\) and \(y_i > y_j\). We will have \(n(n-1)/2\) of pairs to compare.

  • The Kendall’s tau (\(\tau\)) correlation coefficient is defined as: \[\tau=\frac{\text{number of concordant pairs} - \text{number of discordant pairs}}{n(n-1)/2}\]

Example: used cars

The table below displays data on Age (in years) and Price (in hundreds of dollars) for a sample of cars of a particular make and model (Weiss, 2012).

Price (\(y\)) Age (\(x\))
85 5
103 4
70 6
82 5
89 5
98 5
66 6
95 6
169 2
70 7
48 7
  • The Pearson correlation coefficient, \[r=\frac{\sum x_i y_i - (\sum x_i) (\sum y_i)/n }{\sqrt{ [\sum x^2_{i} -(\sum x_i)^2/n] [\sum y^2_{i} -(\sum y_i)^2/n] } } \] \[r=\frac{4732-(58)(975)/11}{\sqrt{(326-58^2/11)( 96129-975^2/11)}}=-0.924\] the value of \(r=-0.924\) suggests a strong negative linear correlation between age and price.

  • Test the hypothesis \(H_0: \rho = 0\) (no linear correlation) against \(H_1: \rho < 0\) (negative correlation) at significant level \(\alpha=0.05\).

Compute the value of the test statistic: \[t=\frac{r\; \sqrt{n-2} }{\sqrt{1-r^{2} } }=\frac{-0.924\sqrt{11-2}}{\sqrt{1-(-0.924)^2}}=-7.249 \]

Since \(t=-7.249<-1.833\), reject \(H_0\).

Using R:

First we need to enter the data in R.

Price<-c(85, 103,  70,  82,  89,  98,  66,  95, 169,  70,  48)
Age<- c(5, 4, 6, 5, 5, 5, 6, 6, 2, 7, 7)
carSales<-data.frame(Price,Age)
str(carSales)
## 'data.frame':    11 obs. of  2 variables:
##  $ Price: num  85 103 70 82 89 98 66 95 169 70 ...
##  $ Age  : num  5 4 6 5 5 5 6 6 2 7 ...

Now let us plot age against price, i.e. a scatterplot.

plot(Price ~ Age, pch=16, col=2)

or we can use ggplot2 for a much nicer plot.

library(ggplot2)
# Basic scatter plot
ggplot(carSales, aes(x=Age, y=Price)) + geom_point()

From this plot it seems that there is a negative linear relationship between age and price. There are several tools that can help us to measure this relationship more precisely.

cor.test(Age, Price,
         alternative = "less",
         method = "pearson", conf.level = 0.95)
## 
##  Pearson's product-moment correlation
## 
## data:  Age and Price
## t = -7.2374, df = 9, p-value = 2.441e-05
## alternative hypothesis: true correlation is less than 0
## 95 percent confidence interval:
##  -1.0000000 -0.7749819
## sample estimates:
##        cor 
## -0.9237821

Suppose now we scale both variables (standardized)

cor.test(scale(Age), scale(Price),
         alternative = "less",
         method = "pearson", conf.level = 0.95)
## 
##  Pearson's product-moment correlation
## 
## data:  scale(Age) and scale(Price)
## t = -7.2374, df = 9, p-value = 2.441e-05
## alternative hypothesis: true correlation is less than 0
## 95 percent confidence interval:
##  -1.0000000 -0.7749819
## sample estimates:
##        cor 
## -0.9237821

We notice that corr(age, price in pounds) \(=\) corr(age, price in dollars).

\(~\)

We can also obtain Spearman’s rho and Kendall’s tau as follows.

cor.test(Age, Price,
         alternative = "less",
         method = "spearman", conf.level = 0.95)
## 
##  Spearman's rank correlation rho
## 
## data:  Age and Price
## S = 403.26, p-value = 0.0007267
## alternative hypothesis: true rho is less than 0
## sample estimates:
##        rho 
## -0.8330014
cor.test(Age, Price,
         alternative = "less",
         method = "kendall", conf.level = 0.95)
## 
##  Kendall's rank correlation tau
## 
## data:  Age and Price
## z = -2.9311, p-value = 0.001689
## alternative hypothesis: true tau is less than 0
## sample estimates:
##        tau 
## -0.7302967

\(~\)

As the p-values for all three tests (Pearson, Spearman, Kendall) less than \(\alpha=0.05\), we reject the null hypothesis of no correlation between the age and the price, at the 5% significance level.

\(~\)

Now what do you think about correlation and causation?

Simple regression: Introduction

Motivation

Predicting the Price of a used car

Simple linear regression

Simple linear regression (population) \[Y=\beta_0+\beta_1 x+\epsilon\] In our example: \[Price=\beta_0+\beta_1 Age+\epsilon\]

Simple linear regression (sample) \[\hat{y}=b_0+b_1 x\] where the coefficient \(\beta_0\) (and its estimate \(b_0\) or \(\hat{\beta}_0\) ) refers to the \(y\)-intercept or simply the intercept or the constant of the regression line, and the coefficient \(\beta_1\) (and its estimate \(b_1\) or \(\hat{\beta}_1\) ) refers to the slope of the regression line.

Least-Squares criterion

  • The least-squares criterion is that the line that best fits a set of data points is the one having the smallest possible sum of squared errors. The `errors’ are the vertical distances of the data points to the line.

  • We need to use the data to estimate the values of the parameters \(\beta_0\) and \(\beta_1\), i.e. to fit a straight line to the set of points \(\{(x_i , y_i )\}\). There are many straight lines we could use, so we need some idea of which is best. Clearly, a bad straight line model would be one that had many large errors, and conversely, a good straight line model will have, on average, small errors. We quantify this by the sum of squares of the errors: \[Q(\beta_0,\beta_1)=\sum_{i=1}^n \epsilon_i^2=\sum_{i=1}^n[y_i-(\beta_0 + \beta_1 x_i)]^2\] then the “line of best fit” will correspond to the line with values of \(\beta_0\) and \(\beta_1\) that minimises \(Q(\beta_0,\beta_1)\).

  • The regression line is the line that fits a set of data points according to the least squares criterion.

  • The regression equation is the equation of the regression line.

  • The regression equation for a set of \(n\) data points is \(\hat{y}=b_0+b_1\;x\), where \[b_1=\frac{S_{xy}}{S_{xx}}=\frac{\sum (x_i-\bar{x})(y_i-\bar{y})}{\sum (x_i-\bar{x})^2} \;\;\text{and}\;\; b_0=\bar{y}-b_1\; \bar{x}\]

  • \(y\) is the dependent variable (or response variable) and \(x\) is the independent variable (predictor variable or explanatory variable).

  • \(b_0\) is called the y-intercept and \(b_1\) is called the slope.

SSE and the standard error

This least square regression line minimizes the error sum of squares \[SSE=\sum e^2_i =\sum (y_i-\hat{y}_i)^2\] The standard error of the estimate, \(s_e=\sqrt{SSE/(n-2)}\), indicates how much, on average, the observed values of the response variable differ from the predicated values of the response variable.

Example: used cars (cont.)

The table below displays data on Age (in years) and Price (in hundreds of dollars) for a sample of cars of a particular make and model.(Weiss, 2012)

Price (\(y\)) Age (\(x\))
85 5
103 4
70 6
82 5
89 5
98 5
66 6
95 6
169 2
70 7
48 7
  • For our example, age is the predictor variable and price is the response variable.

  • The regression equation is \(\hat{y}=195.47-20.26\;x\), where the slope \(b_1=-20.26\) and the intercept \(b_0=195.47\)

  • Prediction: for \(x = 4\), that is we would like to predict the price of a 4-year-old car, \[\hat{y}=195.47-20.26 {\color{blue}(4)}= 114.43 \;\;\text{or}\;\; \$ 11443\]

Back to our used cars example, we want to find the “best line” through the data points, which can be used to predict prices of used cars based on their age.

First we need to enter the data in R.

Price<-c(85, 103,  70,  82,  89,  98,  66,  95, 169,  70,  48)
Age<- c(5, 4, 6, 5, 5, 5, 6, 6, 2, 7, 7)
carSales<-data.frame(Price,Age)
str(carSales)
## 'data.frame':    11 obs. of  2 variables:
##  $ Price: num  85 103 70 82 89 98 66 95 169 70 ...
##  $ Age  : num  5 4 6 5 5 5 6 6 2 7 ...
cor(Age, Price, method = "pearson")
## [1] -0.9237821

Scatterplot: Age vs. Price

library(ggplot2)
ggplot(carSales, aes(x=Age, y=Price)) + geom_point()

# Remove the confidence interval
ggplot(carSales, aes(x=Age, y=Price)) + 
  geom_point()+
  geom_smooth(method=lm, formula= y~x, se=FALSE)

Prediction

# simple linear regression
reg<-lm(Price~Age)
print(reg)
## 
## Call:
## lm(formula = Price ~ Age)
## 
## Coefficients:
## (Intercept)          Age  
##      195.47       -20.26

To predict the price of a 4-year-old car (\(x=4\)): \[\hat{y}=195.47-20.26(4)=114.43\]

# Simple Regression: Coefficient of Determination

Extrapolation

  • Within the range of the observed values of the predictor variable, we can reasonably use the regression equation to make predictions for the response variable.

  • However, to do so outside the range, which is called Extrapolation, may not be reasonable because the linear relationship between the predictor and response variables may not hold here.

  • To predict the price of an 11-year old car, \(\hat{y}=195.47-20.26 (11)=-27.39\) or $ 2739, this result is unrealistic as no one is going to pay us $2739 to take away their 11-year old car.

Outliers and influential observations

  • Recall that an outlier is an observation that lies outside the overall pattern of the data. In the context of regression, an outlier is a data point that lies far from the regression line, relative to the other data points.

  • An influential observation is a data point whose removal causes the regression equation (and line) to change considerably.

  • From the scatterplot, it seems that the data point (2,169) might be an influential observation. Removing that data point and recalculating the regression equation yields \(\hat{y}=160.33-14.24\;x\).

Coefficient of determination

  • The total variation in the observed values of the response variable, \(SST=\sum(y_i-\bar{y})^2\), can be partitioned into two components:

    • The variation in the observed values of the response variable explained by the regression: \(SSR=\sum(\hat{y}_i-\bar{y})^2\)
    • The variation in the observed values of the response variable not explained by the regression: \(SSE=\sum(y_i-\hat{y}_i)^2\)
  • The coefficient of determination, \(R^2\) (or \(R\)-square), is the proportion of the variation in the observed values of the response variable explained by the regression, which is given by \[R^2=\frac{SSR}{SST}=\frac{SST-SSE}{SST}=1-\frac{SSE}{SST}\] where \(SST=SSR+SSE\). \(R^2\) is a descriptive measure of the utility of the regression equation for making prediction.

  • The coefficient of determination \(R^2\) always lies between 0 and 1. A value of \(R^2\) near 0 suggests that the regression equation is not very useful for making predictions, whereas a value of \(R^2\) near 1 suggests that the regression equation is quite useful for making predictions.

  • For a simple linear regression (one independent variable) ONLY, \(R^2\) is the square of Pearson correlation coefficient, \(r\).

  • \(\text{Adjusted}\;R^2\) is a modification of \(R^2\) which takes into account the number of independent variables, say \(k\). In a simple linear regression \(k=1\). Adjusted-\(R^2\) increases only when a significant related independent variable is added to the model. Adjusted-\(R^2\) has a crucial role in the process of model building. Adjusted-\(R^2\) is given by \[\text{Adjusted-}R^2=1-(1-R^2)\frac{n-1}{n-k-1}\]

Notation used in regression

Quantity Defining formula Computing formula
\(S_{xx}\) \(\sum (x_i-\bar{x})^2\) \(\sum x^2_i - n \bar{x}^2\)
\(S_{xy}\) \(\sum (x_i-\bar{x})(y_i-\bar{y})\) \(\sum x_i y_i - n \bar{x}\bar{y}\)
\(S_{yy}\) \(\sum (y_i-\bar{y})^2\) \(\sum y^2_i - n \bar{y}^2\)

where \(\bar{x}=\frac{\sum x_i}{n}\) and \(\bar{y}=\frac{\sum y_i}{n}\). And,

\[SST=S_{yy},\;\;\; SSR=\frac{S^2_{xy}}{S_{xx}},\;\;\; SSE=S_{yy}-\frac{S^2_{xy}}{S_{xx}} \] and \(SST=SSR+SSE\).

Simple Linear Regression: Assumptions

Recall that the simple linear regression model for \(Y\) on \(x\) is \[Y=\beta_0+\beta_1 x+\epsilon\] where

\(Y\) : the dependent or response variable

\(x\) : the independent or predictor variable, assumed known

\(\beta_0,\beta_1\) : the regression parameters, the intercept and slope of the regression line

\(\epsilon\) : the random regression error around the line.

and the regression equation for a set of \(n\) data points is \(\hat{y}=b_0+b_1\;x\), where \[b_1=\frac{S_{xy}}{S_{xx}}=\frac{\sum (x_i-\bar{x})(y_i-\bar{y})}{\sum (x_i-\bar{x})^2}\] and \[b_0=\bar{y}-b_1\; \bar{x}\] where \(b_0\) is called the y-intercept and \(b_1\) is called the slope.

The residual standard error \(s_e\) can be defined as

\[s_e=\sqrt{\frac{SSE}{n-2}}=\sqrt{\frac{\sum(y_i-\hat{y}_i)^2}{n-2}} \] \(s_e\) indicates how much, on average, the observed values of the response variable differ from the predicated values of the response variable. \(~\)

Simple Linear Regression Assumptions (SLR)

We have a collection of \(n\) pairs of observations \(\{(x_i,y_i)\}\), and the idea is to use them to estimate the unknown parameters \(\beta_0\) and \(\beta_1\). \[\epsilon_i=Y_i-(\beta_0+\beta_1\;x_i)\;,\;\;i=1,2,\ldots,n\]

We need to make the following key assumptions on the errors:

A. \(E(\epsilon_i)=0\) (errors have mean zero and do not depend on \(x\))

B. \(Var(\epsilon_i)=\sigma^2\) (errors have a constant variance, homoscedastic, and do not depend on \(x\))

C. \(\epsilon_1, \epsilon_2,\ldots \epsilon_n\) are independent.

D. \(\epsilon_i \mbox{ are all i.i.d. } N(0, \;\sigma^2)\), meaning that the errors are independent and identically distributed as Normal with mean zero and constant variance \(\sigma^2\).

The above assumptions, and conditioning on \(\beta_0\) and \(\beta_1\), imply:

  1. Linearity: \(E(Y_i|X_i)=\beta_0+\beta_1\;x_i\)

  2. Homogenity or homoscedasticity: \(Var(Y_i|X_i)=\sigma^2\)

  3. Independence: \(Y_1,Y_2,\ldots,Y_n\) are all independent given \(X_i\).

  4. Normality: \(Y_i|X_i\sim N(\beta_0+\beta_1x_i,\;\sigma^2)\)

Example: used cars (cont.)

We can see that for each age, the mean price of all cars of that age lies on the regression line \(E(Y|x)=\beta_0+\beta_1x\). And, the prices of all cars of that age are assumed to be normally distributed with mean equal to \(\beta_0+\beta_1 x\) and variance \(\sigma^2\). For example, the prices of all 4-year-old cars must be normally distributed with mean \(\beta_0+\beta_1 (4)\) and variance \(\sigma^2\).

We used the least square method to find the best fit for this data set, and residuals can be obtained as \(e_i=y_i-\hat{y_i}= y_i-(195.47 -20.26x_i)\).

Residual Analysis

The easiest way to check the simple linear regression assumptions is by constructing a scatterplot of the residuals (\(e_i=y_i-\hat{y_i}\)) against the fitted values (\(\hat{y_i}\)) or against \(x\). If the model is a good fit, then the residual plot should show an even and random scatter of the residuals.

Linearity

The regression needs to be linear in the parameters.

\[Y=\beta_0+\beta_1\;x+\epsilon\] \[E(Y_i|X_i)=\beta_0+\beta_1\;x_i \equiv E(\epsilon_i|X_i)=E(\epsilon_i)=0\]

The residual plot below shows that a linear regression model is not appropriate for this data.

Constant error variance (homoscedasticity)

The plot shows the spread of the residuals is increasing as the fitted values (or \(x\)) increases, which indicates that we have Heteroskedasticity (non-constant variance). The standard errors are biased when heteroskedasticity is present, but the regression coefficients will still be unbiased.

How to detect?

  • Residuals plot (fitted vs residuals)

  • Goldfeld–Quandt test

  • Breusch-Pagan test

How to fix?

  • White’s standard errors

  • Weighted least squares model

  • Taking the log

Independent errors terms (no autocorrelation)

The problem of autocorrelation is most likely to occur in time series data, however, it can also occur in cross-sectional data, e.g. if the model is incorrectly specified. When autocorrelation is present, the regression coefficients will still be unbiased, however, the standard errors and test statitics are no longer valid.

An example of no autocorrelation

An example of positive autocorrelation

An example of negative autocorrelation

How to detect?

  • Residuals plot

  • Durbin-Watson test

  • Breusch-Godfrey test

How to fix?

  • Investigate omitted variables (e.g. trend, business cycle)

  • Use advanced models (e.g. AR model)

Normality of the errors

We need the errors to be normally distributed. Normality is only required for the sampling distributions, hypothesis testing and confidence intervals.

How to detect?

  • Histogram of residuals

  • Q-Q plot of residuals

  • Kolmogorov–Smirnov test

  • Shapiro–Wilk test

How to fix?

  • Change the functional form (e.g. taking the log)

  • Larger sample if possible

Example: Infant mortality and GDP

Let us investigate the relationship between infant mortality and the wealth of a country. We will use data on 207 countries of the world gathered by the UN in 1998 (the ‘UN’ data set is available from the R package ‘car’). The data set contains two variables: the infant mortality rate in deaths per 1000 live births, and the GDP per capita in US dollars. There are some missing data values for some countries, so we will remove the missing data before we fit our model.

# install.packages("carData")
library(carData)
data(UN)
options(scipen=999)
# Remove missing data
newUN<-na.omit(UN) 
str(newUN)
## 'data.frame':    193 obs. of  7 variables:
##  $ region         : Factor w/ 8 levels "Africa","Asia",..: 2 4 1 1 5 2 3 8 4 2 ...
##  $ group          : Factor w/ 3 levels "oecd","other",..: 2 2 3 3 2 2 2 1 1 2 ...
##  $ fertility      : num  5.97 1.52 2.14 5.13 2.17 ...
##  $ ppgdp          : num  499 3677 4473 4322 9162 ...
##  $ lifeExpF       : num  49.5 80.4 75 53.2 79.9 ...
##  $ pctUrban       : num  23 53 67 59 93 64 47 89 68 52 ...
##  $ infantMortality: num  124.5 16.6 21.5 96.2 12.3 ...
##  - attr(*, "na.action")= 'omit' Named int [1:20] 4 6 21 35 38 54 67 75 77 78 ...
##   ..- attr(*, "names")= chr [1:20] "American Samoa" "Anguilla" "Bermuda" "Cayman Islands" ...
fit<- lm(infantMortality ~ ppgdp, data=newUN)
summary(fit)
## 
## Call:
## lm(formula = infantMortality ~ ppgdp, data = newUN)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -31.48 -18.65  -8.59  10.86  83.59 
## 
## Coefficients:
##               Estimate Std. Error t value             Pr(>|t|)    
## (Intercept) 41.3780016  2.2157454  18.675 < 0.0000000000000002 ***
## ppgdp       -0.0008656  0.0001041  -8.312   0.0000000000000173 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 25.13 on 191 degrees of freedom
## Multiple R-squared:  0.2656, Adjusted R-squared:  0.2618 
## F-statistic: 69.08 on 1 and 191 DF,  p-value: 0.0000000000000173
plot(newUN$infantMortality ~ newUN$ppgdp, xlab="GDP per Capita", ylab="Infant mortality (per 1000 births)", pch=16, col="cornflowerblue")
abline(fit,col="red")

We can see from the scatterplot that the relationship between the two variables is not linear. There is a concentration of data points at small values of GDP (many poor countries) and a concentration of data points at small values of infant mortality (many countries with very low mortality). This suggests a skewness to both variables which would not conform to the normality assumption. Indeed, the regression line (the red line) we construct is a poor fit and only has an \(R^2\) of 0.266.

From the residual plot below we can see a clear evidence of structure to the residuals suggesting the linear relationship is a poor description of the data, and substantial changes in spread suggesting the assumption of homogeneous variance is not appropriate.

# diagnostic plots 
plot(fit,which=1,pch=16,col="cornflowerblue")

So we can apply a transformation to one or both variables, e.g. taking the log or adding a quadratic form. Notice that this will not affect (violet) the linearity assumption as the regression will still be linear in the parameters. So if we take the logs of both variables gives us the scatterplot of the transformed data set, below, which appears to show a more promising linear structure. The quality of the regression is now improved, with an \(R^2\) value of 0.766, which is still a little weak due to the rather large spread in the data.

fit1<- lm(log(infantMortality) ~ log(ppgdp), data=newUN)
summary(fit1)
## 
## Call:
## lm(formula = log(infantMortality) ~ log(ppgdp), data = newUN)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.16789 -0.36738 -0.02351  0.24544  2.43503 
## 
## Coefficients:
##             Estimate Std. Error t value            Pr(>|t|)    
## (Intercept)  8.10377    0.21087   38.43 <0.0000000000000002 ***
## log(ppgdp)  -0.61680    0.02465  -25.02 <0.0000000000000002 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.5281 on 191 degrees of freedom
## Multiple R-squared:  0.7662, Adjusted R-squared:  0.765 
## F-statistic: 625.9 on 1 and 191 DF,  p-value: < 0.00000000000000022
plot(log(newUN$infantMortality) ~ log(newUN$ppgdp), xlab="GDP per Capita", ylab="Infant mortality (per 1000 births)", pch=16, col="cornflowerblue")
abline(fit1,col="red")

So we check the residuals again, as we can see from the residuals plot below that the log transformation has corrected many of the problems with with residual plot and the residuals now much closer to the expected random scatter.

# diagnostic plots 
plot(fit1,which=1,pch=16,col="cornflowerblue")

Now let us check the Normality of the errors by creating a histogram and normal QQ plot for the residuals, before and after the log transformation. The normal quantile (QQ) plot shows the sample quantiles of the residuals against the theoretical quantiles that we would expect if these values were drawn from a Normal distribution. If the Normal assumption holds, then we would see an approximate straight-line relationship on the Normal quantile plot.

par(mfrow=c(2,2))
# before the log  transformation.
plot(fit, which = 2,pch=16, col="cornflowerblue")
hist(resid(fit),col="cornflowerblue",main="")
# after the log  transformation.
plot(fit1, which = 2, pch=16, col="hotpink3")  
hist(resid(fit1),col="hotpink3",main="")

The normal quantile plot and the histogram of residuals (before the log transformation) shows strong departure from the expectation of an approximate straight line, with curvature in the tails which reflects the skewness of the data. Finally, the normal quantile plot and the histogram of residuals suggest that residuals are much closer to Normality after the transformation, with some minor deviations in the tails.